Mortgage Probability of Default & Mortgage Fraud¶
Data Analyst: Frankie Ma
Introduction: Mortgage fraud and default pose significant risks to both lenders and borrowers in the U.S. housing market. With the rise of online services offering fake documentation and fraudulent verification, lenders face increasing difficulty in validating borrower income and intent. These schemes—often masked as novelty tools—can lead to occupancy misrepresentation, inflated credit profiles, and fraudulent loan approvals. As lenders like Fannie Mae and Freddie Mac tighten their risk protocols, the consequences of undetected fraud ripple through financial institutions, investors, and ultimately taxpayers. At the same time, mortgage defaults continue to trigger complex legal processes, including foreclosures, which carry steep financial and reputational costs. In this project, we apply machine learning models to assess default risk and explore how data-driven strategies can help lenders better detect anomalies, protect against fraudulent applications, and reduce financial losses across varying interest rate scenarios.
from IPython import display
display.Image("Mortgage.png")
[This picture is generated by AI]
Project Goal: Our project aims to develop predictive models that identify high-risk mortgage loan applications by detecting potential defaults and fraudulent behavior. By leveraging machine learning and profit-based evaluation across different interest rate scenarios, we seek to help lenders make more informed, data-driven decisions that reduce financial exposure and enhance loan portfolio quality.
# Before we get started, let's load all the packages we are going to use in this project.
# data
import pandas as pd
import numpy as np
# visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#import missingno as msno
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
#from wordcloud import WordCloud
from sklearn import datasets
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# styling
%matplotlib inline
sns.set_style('darkgrid')
mpl.rcParams['font.size'] = 12
mpl.rcParams['figure.facecolor'] = '#00000000'
mpl.rcParams['font.size'] = 12
mpl.rcParams['figure.facecolor'] = '#00000000'
import os
#from wordcloud import WordCloud
import warnings
warnings.filterwarnings("ignore")
Section 1 Anomaly Detection through Feature Engineering ¶
df = pd.read_csv('XYZloan_default_llm.csv')
df.head()
| Unnamed: 0.1 | Unnamed: 0 | AP001 | AP002 | AP003 | AP006 | AP007 | AP008 | CR004 | CR009 | ... | TD005 | TD006 | TD009 | TD010 | TD013 | TD014 | TD022 | TD024 | loan_default | reason | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 76031 | 33 | 1 | 3 | h5 | 4 | 3 | 4 | 63100 | ... | 4 | 1 | 4 | 1 | 4 | 1 | 10.0 | 0.0 | 1 | I’d really appreciate if we could move faster ... |
| 1 | 5 | 23312 | 34 | 1 | 3 | h5 | 5 | 5 | 3 | 53370 | ... | 3 | 1 | 6 | 2 | 7 | 2 | 15.0 | 10.0 | 1 | We’re trying to align the closing date with a ... |
| 2 | 9 | 66033 | 36 | 2 | 1 | ios | 2 | 2 | 3 | 5400 | ... | 4 | 2 | 4 | 2 | 5 | 2 | 25.0 | 0.0 | 1 | It would really help to close this week so I c... |
| 3 | 10 | 41847 | 28 | 1 | 1 | ios | 5 | 5 | 3 | 2000 | ... | 4 | 4 | 7 | 4 | 7 | 4 | 25.0 | 6.0 | 1 | There are some logistics around my move that m... |
| 4 | 13 | 28275 | 35 | 2 | 4 | h5 | 3 | 3 | 4 | 27704 | ... | 4 | 1 | 4 | 1 | 7 | 1 | 25.0 | 0.0 | 1 | I’d like to close by Friday if possible—the se... |
5 rows × 32 columns
columns = ['Unnamed: 0.1', 'Unnamed: 0', 'AP001', 'AP002', 'AP003', 'AP006',
'AP007', 'AP008', 'CR004', 'CR009', 'CR015', 'CR017', 'CR018', 'CR019',
'MB005', 'MB007', 'PA022', 'PA023', 'PA028', 'PA029', 'PA031', 'TD001',
'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD022', 'TD024',
'loan_default', 'reason']
data = df[columns]
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16000 entries, 0 to 15999 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0.1 16000 non-null int64 1 Unnamed: 0 16000 non-null int64 2 AP001 16000 non-null int64 3 AP002 16000 non-null int64 4 AP003 16000 non-null int64 5 AP006 16000 non-null object 6 AP007 16000 non-null int64 7 AP008 16000 non-null int64 8 CR004 16000 non-null int64 9 CR009 16000 non-null int64 10 CR015 16000 non-null int64 11 CR017 16000 non-null int64 12 CR018 16000 non-null int64 13 CR019 16000 non-null int64 14 MB005 15441 non-null float64 15 MB007 16000 non-null object 16 PA022 15918 non-null float64 17 PA023 15918 non-null float64 18 PA028 15918 non-null float64 19 PA029 15918 non-null float64 20 PA031 15918 non-null float64 21 TD001 16000 non-null int64 22 TD005 16000 non-null int64 23 TD006 16000 non-null int64 24 TD009 16000 non-null int64 25 TD010 16000 non-null int64 26 TD013 16000 non-null int64 27 TD014 16000 non-null int64 28 TD022 12019 non-null float64 29 TD024 15251 non-null float64 30 loan_default 16000 non-null int64 31 reason 16000 non-null object dtypes: float64(8), int64(21), object(3) memory usage: 3.9+ MB
1.1 Target Encoding ¶
# specify categorical & numeric data type
cat_var = ['AP006', 'MB007', 'reason']
num_var = ['Unnamed: 0.1', 'Unnamed: 0', 'AP001', 'AP002', 'AP003', 'AP007',
'AP008', 'CR004', 'CR009', 'CR015', 'CR017', 'CR018', 'CR019',
'MB005', 'PA022', 'PA023', 'PA028', 'PA029', 'PA031', 'TD001',
'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD022', 'TD024']
X_vars = cat_var + num_var
target = 'loan_default'
data[target].value_counts()
loan_default 0 12924 1 3076 Name: count, dtype: int64
1.1.1 Data Spliting ¶
X = data.drop(target, axis = 1)
y = data[target]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
[X_train.shape,X_test.shape, y_train.shape, y_test.shape]
[(12000, 31), (4000, 31), (12000,), (4000,)]
1.1.2 SWEETVIZ Report ¶
import sweetviz as sv
train = pd.concat([X_train, y_train], axis=1)
report = sv.analyze(train)
report.show_html()
| | [ 0%] 00:00 -> (? left)
Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
report = sv.analyze(train, target_feat = target)
report.show_html('repot.html')
report.show_notebook()
| | [ 0%] 00:00 -> (? left)
Report repot.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
From the report, we can tell that there are only a few missing data in our dataset. However, there seems to be a lot of outliers under CR009, PA029, TD001, TD005, TD006, TD009, TD010, and TD014. These columns probably needs some further inspection.
1.1.3 Feature enhancement for the categorical variables ¶
In this section, we are going to create a new set of mean-encoded variables for the categorical variables, then to use the X-train to get the mean first to apply to the test data afterwards.
from category_encoders import target_encoder as te
ec = te.TargetEncoder()
X_train_m = pd.DataFrame()
X_test_m = pd.DataFrame()
for var in cat_var:
X_train_m[var+'_D'] = ec.fit_transform(X_train[var],y_train)
X_test_m[var+'_D'] = ec.transform(X_test[var])
X_train_m
| AP006_D | MB007_D | reason_D | |
|---|---|---|---|
| 8928 | 0.188905 | 0.189318 | 4.365192e-05 |
| 9895 | 0.188905 | 0.189318 | 1.774992e-05 |
| 14163 | 0.220118 | 0.245247 | 0.000000e+00 |
| 14352 | 0.180544 | 0.164782 | 7.235891e-07 |
| 15916 | 0.180544 | 0.178959 | 3.017956e-16 |
| ... | ... | ... | ... |
| 13123 | 0.188905 | 0.189318 | 1.318461e-06 |
| 3264 | 0.188905 | 0.189318 | 0.000000e+00 |
| 9845 | 0.188905 | 0.189318 | 0.000000e+00 |
| 10799 | 0.180544 | 0.178959 | 4.345156e-04 |
| 2732 | 0.188905 | 0.189318 | 4.984015e-01 |
12000 rows × 3 columns
1.1.4 Feature enhancement for the continuous variables ¶
# let's inspect one of the data and cut into 5 bins
X_train['MB005_D'] = pd.qcut(X_train['MB005'], 5)
X_train['MB005_D']
8928 (-0.001, 3.0]
9895 (4.0, 6.0]
14163 (6.0, 9.0]
14352 (9.0, 47.0]
15916 (3.0, 4.0]
...
13123 (9.0, 47.0]
3264 (6.0, 9.0]
9845 (4.0, 6.0]
10799 (6.0, 9.0]
2732 (4.0, 6.0]
Name: MB005_D, Length: 12000, dtype: category
Categories (5, interval[float64, right]): [(-0.001, 3.0] < (3.0, 4.0] < (4.0, 6.0] < (6.0, 9.0] < (9.0, 47.0]]
# according to the report, there are about 52 NA under PA029
# drop all missing values
X_train['PA029_D'] = pd.qcut(X_train['PA029'], 5, duplicates='drop').cat.add_categories('NoData')
X_train['PA029_D'] = X_train['PA029_D'].fillna('NoData').astype(str)
X_train['PA029_D']
8928 (40.218, 1462.0]
9895 (-99.001, -98.0]
14163 (40.218, 1462.0]
14352 (-99.001, -98.0]
15916 (-99.001, -98.0]
...
13123 (-99.001, -98.0]
3264 (-99.001, -98.0]
9845 (-98.0, 40.218]
10799 (40.218, 1462.0]
2732 (40.218, 1462.0]
Name: PA029_D, Length: 12000, dtype: object
PA029_cutpoints = X_train['PA029'].quantile([0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0])
PA029_cutpoints
0.0 -99.000000 0.1 -98.000000 0.2 -98.000000 0.3 -98.000000 0.4 -98.000000 0.5 -98.000000 0.6 -98.000000 0.7 12.595556 0.8 40.217948 0.9 82.000000 1.0 1462.000000 Name: PA029, dtype: float64
Looks like our continuous variable, PA029, has not only has several NA values, but also '-99' and '-98' in it.
X_train.loc[X_train['PA029_D']==-99, 'PA029_D'] = -99
X_train.loc[X_train['PA029_D']==-98, 'PA029_D'] = -98
X_train['PA029_D'].value_counts()
PA029_D (-99.001, -98.0] 8128 (40.218, 1462.0] 2390 (-98.0, 40.218] 1430 NoData 52 Name: count, dtype: int64
train_bins = pd.qcut(X_train['PA029'], q=5, duplicates='drop', retbins=True)[1]
train_bins
array([ -99. , -98. , 40.2179484, 1462. ])
# apply same bins to the
X_test['PA029_D'] = pd.cut(X_test['PA029'], bins=train_bins, include_lowest=True)
X_test['PA029_D'] = pd.cut(X_test['PA029'], bins=train_bins, include_lowest=True).cat.add_categories('NoData')
X_test['PA029_D'] = X_test['PA029_D'].fillna('NoData').astype('str')
X_test.loc[X_test['PA029_D']==-99, 'PA029_D'] = -99
X_test.loc[X_test['PA029_D']==-98, 'PA029_D'] = -98
X_test['PA029_D']
841 (40.218, 1462.0]
8650 (-99.001, -98.0]
3701 (-98.0, 40.218]
13270 (-99.001, -98.0]
12349 (-99.001, -98.0]
...
190 (-98.0, 40.218]
4816 (-99.001, -98.0]
8671 (-99.001, -98.0]
12539 (-99.001, -98.0]
11619 (-98.0, 40.218]
Name: PA029_D, Length: 4000, dtype: object
X_test['PA029_D'].value_counts()
PA029_D (-99.001, -98.0] 2780 (40.218, 1462.0] 776 (-98.0, 40.218] 413 NoData 31 Name: count, dtype: int64
# target encoding
ec = te.TargetEncoder(cols=['PA029_D'])
X_train_e = pd.DataFrame()
X_test_e = pd.DataFrame()
X_train_e['PA029_D'] = ec.fit_transform(X_train['PA029_D'], y_train)['PA029_D']
X_test_e['PA029_D'] = ec.transform(X_test['PA029_D'])['PA029_D']
X_train_e
| PA029_D | |
|---|---|
| 8928 | 0.260669 |
| 9895 | 0.174090 |
| 14163 | 0.260669 |
| 14352 | 0.174090 |
| 15916 | 0.174090 |
| ... | ... |
| 13123 | 0.174090 |
| 3264 | 0.174090 |
| 9845 | 0.195804 |
| 10799 | 0.260669 |
| 2732 | 0.260669 |
12000 rows × 1 columns
X_train_m = pd.concat([X_train_m, X_train_e], axis=1)
X_test_m = pd.concat([X_test_m, X_test_e], axis=1)
1.2 Imputing Missing Values ¶
X_train_numvar = X_train[num_var]
missing_columns = X_train_numvar.columns[X_train_numvar.isnull().sum() > 0]
# Display the columns with missing values
missing_columns
Index(['MB005', 'PA022', 'PA023', 'PA028', 'PA029', 'PA031', 'TD022', 'TD024'], dtype='object')
for col in missing_columns:
mean_value = X_train[col].mean()
# Impute the missing values in both X_train and X_test
X_train[col].fillna(mean_value, inplace=True)
X_test[col].fillna(mean_value, inplace=True)
Section 2 Model Building ¶
The categorical columns are in X_train_dummy while the numerical columns are filled with mean values in X_train[num_var]. To model the data for following analysis, we will combine the X_train_dummy and X_train[num_var] (similar for test data).
# modeling data
X_train_model = pd.concat([X_train_m, X_train[num_var]], axis=1)
X_train_model.shape
X_test_model = pd.concat([X_test_m, X_test[num_var]], axis=1)
[X_train_model.shape, X_test_model.shape]
[(12000, 32), (4000, 32)]
In our case, there are 12,924 0 and 3,076 1, so 0s are our majority, and 1s are our minority. Here, we are going to test out both sampling techniques, over-sampling and under-sampling in this section.
from sklearn.decomposition import PCA
from collections import Counter
import pylab as pl
from sklearn.preprocessing import StandardScaler
X_train_model = X_train_model.loc[y_train.index]
X_train_model = X_train_model.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
def plot_this(X_rs, y_rs, method):
X_scaled = StandardScaler().fit_transform(X_rs) # scale to mean 0, std 1
p2 = PCA(n_components=2).fit_transform(X_scaled)
y_rs = y_rs.to_numpy()
pl.scatter(p2[y_rs == 0, 0], p2[y_rs == 0, 1], c='lightblue', s=10, label='Class 0')
pl.scatter(p2[y_rs == 1, 0], p2[y_rs == 1, 1], c='orange', s=10, label='Class 1')
pl.legend()
pl.title(method)
pl.axis([-4, 5, -4, 4])
pl.show()
plot_this(X_train_model,y_train,'Original')
2.1 Over-Sampling ¶
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy={1: 3076, 0: 12924})
X_train_o, y_train_o = smote.fit_resample(X_train_model, y_train)
print('SMOTE {}'.format(Counter(y_train_o)))
plot_this(X_train_o, y_train_o,'SMOTE')
SMOTE Counter({0: 12924, 1: 3076})
2.1.1 H2O Modeling ¶
# H2O initialization
import h2o
from h2o.automl import H2OAutoML
h2o.init()
train_model_o=pd.concat([X_train_o, y_train_o],axis=1)
test_model_o=pd.concat([X_test_model,y_test],axis=1)
# Convert to H2O Frame
train_hex_o = h2o.H2OFrame(train_model_o)
test_hex_o = h2o.H2OFrame(test_model_o)
train_hex_o[target] = train_hex_o[target].asfactor()
test_hex_o[target] = test_hex_o[target].asfactor()
Checking whether there is an H2O instance running at http://localhost:54321. connected. Warning: Your H2O cluster version is (4 months and 12 days) old. There may be a newer version available. Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
| H2O_cluster_uptime: | 5 days 3 hours 22 mins |
| H2O_cluster_timezone: | America/New_York |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.46.0.7 |
| H2O_cluster_version_age: | 4 months and 12 days |
| H2O_cluster_name: | H2O_from_python_mxx_h398lp |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 2.982 Gb |
| H2O_cluster_total_cores: | 8 |
| H2O_cluster_allowed_cores: | 8 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://localhost:54321 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| Python_version: | 3.11.4 final |
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% ████████████████████████████████████████████████████████████████| (done) 100%
# Run AutoML
aml = H2OAutoML(max_models=20, seed=1)
aml.train(y=target, training_frame=train_hex_o)
# View leaderboard
lb = aml.leaderboard
print(lb)
# Predict
preds = aml.leader.predict(test_hex_o)
The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
AutoML progress: |
21:44:30.925: AutoML: XGBoost is not available; skipping it.
███████████████████████████████████████████████████████████████| (done) 100%
model_id auc logloss aucpr mean_per_class_error rmse mse
GBM_grid_1_AutoML_5_20250808_214430_model_1 1 3.55804e-06 1 0 2.9717e-05 8.83099e-10
GBM_5_AutoML_5_20250808_214430 1 2.88658e-18 1 0 2.5317e-17 6.40949e-34
StackedEnsemble_BestOfFamily_1_AutoML_5_20250808_214430 1 6.38125e-05 1 0 0.000292585 8.56059e-08
StackedEnsemble_AllModels_1_AutoML_5_20250808_214430 1 4.06801e-05 1 0 5.16192e-05 2.66455e-09
GBM_3_AutoML_5_20250808_214430 1 1.45717e-17 1 0 5.91656e-17 3.50057e-33
GBM_2_AutoML_5_20250808_214430 1 6.245e-18 1 0 3.77313e-17 1.42365e-33
GLM_1_AutoML_5_20250808_214430 1 0.000652354 1 0 0.00428889 1.83946e-05
GBM_grid_1_AutoML_5_20250808_214430_model_3 1 2.49221e-13 1 0 2.10333e-11 4.424e-22
DeepLearning_grid_1_AutoML_5_20250808_214430_model_1 1 0.00192025 1 0 0.0253579 0.000643021
GBM_4_AutoML_5_20250808_214430 1 5.76761e-17 1 0 1.97638e-16 3.90609e-32
[22 rows x 7 columns]
███████████████████████████████████████████████████████| (done) 100%
aml.leaderboard.head(20)
| model_id | auc | logloss | aucpr | mean_per_class_error | rmse | mse |
|---|---|---|---|---|---|---|
| GBM_grid_1_AutoML_5_20250808_214430_model_1 | 1 | 3.55804e-06 | 1 | 0 | 2.9717e-05 | 8.83099e-10 |
| GBM_5_AutoML_5_20250808_214430 | 1 | 2.88658e-18 | 1 | 0 | 2.5317e-17 | 6.40949e-34 |
| StackedEnsemble_BestOfFamily_1_AutoML_5_20250808_214430 | 1 | 6.38125e-05 | 1 | 0 | 0.000292585 | 8.56059e-08 |
| StackedEnsemble_AllModels_1_AutoML_5_20250808_214430 | 1 | 4.06801e-05 | 1 | 0 | 5.16192e-05 | 2.66455e-09 |
| GBM_3_AutoML_5_20250808_214430 | 1 | 1.45717e-17 | 1 | 0 | 5.91656e-17 | 3.50057e-33 |
| GBM_2_AutoML_5_20250808_214430 | 1 | 6.245e-18 | 1 | 0 | 3.77313e-17 | 1.42365e-33 |
| GLM_1_AutoML_5_20250808_214430 | 1 | 0.000652354 | 1 | 0 | 0.00428889 | 1.83946e-05 |
| GBM_grid_1_AutoML_5_20250808_214430_model_3 | 1 | 2.49221e-13 | 1 | 0 | 2.10333e-11 | 4.424e-22 |
| DeepLearning_grid_1_AutoML_5_20250808_214430_model_1 | 1 | 0.00192025 | 1 | 0 | 0.0253579 | 0.000643021 |
| GBM_4_AutoML_5_20250808_214430 | 1 | 5.76761e-17 | 1 | 0 | 1.97638e-16 | 3.90609e-32 |
| DRF_1_AutoML_5_20250808_214430 | 1 | 0.0321971 | 1 | 0 | 0.0530536 | 0.00281468 |
| XRT_1_AutoML_5_20250808_214430 | 1 | 0.0292763 | 1 | 0 | 0.0454691 | 0.00206744 |
| GBM_grid_1_AutoML_5_20250808_214430_model_5 | 1 | 0 | 1 | 0 | 0 | 0 |
| GBM_grid_1_AutoML_5_20250808_214430_model_4 | 1 | 3.8138e-14 | 1 | 0 | 3.68861e-13 | 1.36058e-25 |
| GBM_1_AutoML_5_20250808_214430 | 1 | 1.6237e-18 | 1 | 0 | 1.97825e-17 | 3.91349e-34 |
| GBM_grid_1_AutoML_5_20250808_214430_model_2 | 1 | 4.16334e-20 | 1 | 0 | 3.04047e-18 | 9.24446e-36 |
| DeepLearning_grid_2_AutoML_5_20250808_214430_model_1 | 1 | 0.00203423 | 1 | 3.86877e-05 | 0.0230191 | 0.000529877 |
| DeepLearning_grid_3_AutoML_5_20250808_214430_model_2 | 1 | 0.000416729 | 1 | 0.000162549 | 0.00975301 | 9.51213e-05 |
| DeepLearning_1_AutoML_5_20250808_214430 | 0.999999 | 0.00109805 | 0.999997 | 0.000278612 | 0.0155298 | 0.000241175 |
| DeepLearning_grid_1_AutoML_5_20250808_214430_model_2 | 0.999998 | 0.00158524 | 0.999993 | 0.000642397 | 0.0222053 | 0.000493076 |
[20 rows x 7 columns]
2.1.2 Top Model Selection ¶
The GBM_grid_1_AutoML_5_20250808_214430_model_1 is the best choice because it delivers perfect classification metrics on the SMOTE-balanced dataset, achieving an AUC of 1.0, zero mean per class error, and extremely low logloss and RMSE. Unlike the top two stacked ensemble models, it meets the requirement of using an individual model while still maintaining exceptional predictive performance.
model_id = 'GBM_grid_1_AutoML_5_20250808_214430_model_1'
best_model_o = h2o.get_model(model_id)
best_model_o.summary()
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 454.0 | 454.0 | 3247052.0 | 16.0 | 16.0 | 15.334802 | 1.0 | 744.0 | 563.20264 |
# Predict on test data
preds = aml.leader.predict(test_hex_o)
# Get performance metrics on test set
perf = aml.leader.model_performance(test_hex_o)
# Print common regression metrics
print("RMSE:", perf.rmse())
#print("MAE :", perf.mae())
print("R2 :", perf.r2())
print("MSE :", perf.mse())
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100% RMSE: 0.04743416500955298 R2 : 0.9851698300946758 MSE : 0.0022500000101535
# Get predictions (returns an H2OFrame with class and probs)
preds = best_model_o.predict(test_hex_o).as_data_frame()
# View the frame
preds.head()
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
/Users/mxx/miniconda3/lib/python3.11/site-packages/h2o/frame.py:1983: H2ODependencyWarning: Converting H2O frame to pandas dataframe using single-thread. For faster conversion using multi-thread, install polars and pyarrow and use it as pandas_df = h2o_df.as_data_frame(use_multi_thread=True)
| predict | p0 | p1 | |
|---|---|---|---|
| 0 | 0 | 0.000002 | 9.999984e-01 |
| 1 | 0 | 1.000000 | 2.767149e-09 |
| 2 | 0 | 1.000000 | 2.456172e-11 |
| 3 | 0 | 1.000000 | 3.648818e-11 |
| 4 | 0 | 1.000000 | 1.405225e-09 |
best_model_o_probs = preds['p1']
best_model_o_pred = preds['predict']
2.1.3 Model Performance ¶
from sklearn.metrics import (roc_curve, roc_auc_score, accuracy_score, confusion_matrix, auc,
average_precision_score, precision_recall_curve, f1_score)
from IPython.display import display
# metrics table
def make_core_metrics_tables(y_true, y_pred, y_prob):
tblA = pd.DataFrame({
"metric": ["Accuracy", "ROC AUC", "Average Precision (AUPRC)", "F1 score"],
"value": [
accuracy_score(y_true, y_pred),
roc_auc_score(y_true, y_prob),
average_precision_score(y_true, y_prob),
f1_score(y_true, y_pred),
],
})
cm_tbl = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=[0,1]),
index=["Actual 0","Actual 1"], columns=["Pred 0","Pred 1"]
)
return tblA, cm_tbl
# ROC-AUC
def plot_roc(y_true,y_prob):
fpr, tpr, _ = roc_curve(y_true, y_prob)
roc_auc_value = roc_auc_score(y_true,y_prob)
lw=2
plt.figure(figsize=(6,4))
plt.plot(fpr,tpr, color='darkorange',lw=lw,label='ROC curve (area = %0.2f)' %roc_auc_value)
plt.plot([0,1],[0,1], color='navy',lw=lw,linestyle='--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend(loc='lower right')
plt.show()
# precision-recall
def ROC_PR(y_actual, y_pred):
# ROC
fpr = list()
tpr = list()
roc_auc = list()
fpr,tpr,_ = roc_curve(y_actual,y_pred)
roc_auc = auc(fpr,tpr)
# Precision-Recall
average_precision = average_precision_score(y_actual,y_pred)
print('')
print(' * ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate')
print('')
print(' * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
print('')
print(' * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
print('')
# plotting
plt.figure(figsize=(10,4))
# ROC
plt.subplot(1,2,1)
plt.plot(fpr,tpr,color='darkorange',lw=2,label='ROC curve (aare=%0.2f)' % roc_auc)
plt.plot([0,1],[0,1],color='navy',lw=3,linestyle='--')
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic: AUC={0:0.3f}'.format(roc_auc))
plt.legend(loc='lower right')
# Precision-Recall
plt.subplot(1,2,2)
precision,recall,_ = precision_recall_curve(y_actual,y_pred)
plt.step(recall,precision,color='b',alpha=0.2,where='post')
plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0,1.05])
plt.xlim([0.0,1.0])
plt.title('Precision-Recall curve: PR={0:0.3f}'.format(average_precision))
plt.show()
# gains table
def gains_table_profit(Y_test,y_pred):
df_prep = pd.DataFrame(columns = ['actual','pred'])
df_prep['actual'] = Y_test
df_prep['pred'] =y_pred
df_prep = df_prep.sort_values(by='pred',ascending=False)
df_prep['row_id'] = range(0,0+len(df_prep))
df_prep.head()
df_prep['decile'] = (df_prep['row_id'] / (len(df_prep)/10)).astype(int)
df_prep.loc[df_prep['decile'] == 10] =9
df_prep['decile'].value_counts()
# Create gains table
gains = df_prep.groupby('decile')['actual'].agg(['count','sum'])
gains.columns = ['count','actual']
gains
gains['non_actual'] = gains['count'] - gains['actual']
gains['cum_count'] = gains['count'].cumsum()
gains['cum_actual'] = gains['actual'].cumsum()
gains['cum_non_actual'] = gains['non_actual'].cumsum()
gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
gains['if_random'] = np.max(gains['cum_actual']) /10
gains['if_random'] = gains['if_random'].cumsum()
gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
gains['K_S'] = np.abs( gains['percent_cum_actual'] - gains['percent_cum_non_actual'] ) * 100
gains['gain'] = (gains['cum_actual'] / gains['cum_count']*100).round(2)
gains['revenue'] = approved_load * irr * gains['non_actual']
gains['cost'] = approved_load * gains['actual']
gains['profit'] = gains['revenue'] - gains['cost']
return(gains)
# Assume the following values
approved_load = 10000
irr = 0.10 # annual interest rate
tblo, cm_tbl_o = make_core_metrics_tables(y_test, best_model_o_pred, best_model_o_probs)
print("Table A — Core sklearn metrics (Over-Sampling)"); display(tblo)
print("\nConfusion Matrix (Over-Sampling)"); display(cm_tbl_o)
Table A — Core sklearn metrics (Over-Sampling)
| metric | value | |
|---|---|---|
| 0 | Accuracy | 0.813500 |
| 1 | ROC AUC | 0.994974 |
| 2 | Average Precision (AUPRC) | 0.993234 |
| 3 | F1 score | 0.000000 |
Confusion Matrix (Over-Sampling)
| Pred 0 | Pred 1 | |
|---|---|---|
| Actual 0 | 3254 | 0 |
| Actual 1 | 746 | 0 |
Although the model shows a very high ROC AUC (0.995) and average precision (0.993), its F1 score is 0.0, and the confusion matrix reveals it predicts all cases as class 0, missing all positive cases (class 1). This indicates severe class imbalance handling failure despite oversampling, resulting in poor recall for the minority class.
plot_roc(y_test, best_model_o_probs)
ROC_PR(y_test, best_model_o_probs)
* ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
The ROC curve shows an AUC of 0.995, indicating near-perfect discrimination between positive and negative classes with almost no trade-off between sensitivity and specificity. The Precision-Recall curve, with an area of 0.993, demonstrates that the model maintains extremely high precision across all recall levels, meaning it can correctly identify the vast majority of positive cases while keeping false positives minimal.
# gains table
best_gains_o = gains_table_profit(y_test,best_model_o_probs)
best_gains_o
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | revenue | cost | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | |||||||||||||||
| 0 | 400 | 316 | 84 | 400 | 316 | 84 | 0.42 | 0.03 | 74.6 | 4.24 | 39.0 | 79.00 | 84000.0 | 3160000 | -3076000.0 |
| 1 | 400 | 306 | 94 | 800 | 622 | 178 | 0.83 | 0.05 | 149.2 | 4.17 | 78.0 | 77.75 | 94000.0 | 3060000 | -2966000.0 |
| 2 | 400 | 124 | 276 | 1200 | 746 | 454 | 1.00 | 0.14 | 223.8 | 3.33 | 86.0 | 62.17 | 276000.0 | 1240000 | -964000.0 |
| 3 | 400 | 0 | 400 | 1600 | 746 | 854 | 1.00 | 0.26 | 298.4 | 2.50 | 74.0 | 46.62 | 400000.0 | 0 | 400000.0 |
| 4 | 400 | 0 | 400 | 2000 | 746 | 1254 | 1.00 | 0.39 | 373.0 | 2.00 | 61.0 | 37.30 | 400000.0 | 0 | 400000.0 |
| 5 | 400 | 0 | 400 | 2400 | 746 | 1654 | 1.00 | 0.51 | 447.6 | 1.67 | 49.0 | 31.08 | 400000.0 | 0 | 400000.0 |
| 6 | 400 | 0 | 400 | 2800 | 746 | 2054 | 1.00 | 0.63 | 522.2 | 1.43 | 37.0 | 26.64 | 400000.0 | 0 | 400000.0 |
| 7 | 400 | 0 | 400 | 3200 | 746 | 2454 | 1.00 | 0.75 | 596.8 | 1.25 | 25.0 | 23.31 | 400000.0 | 0 | 400000.0 |
| 8 | 400 | 0 | 400 | 3600 | 746 | 2854 | 1.00 | 0.88 | 671.4 | 1.11 | 12.0 | 20.72 | 400000.0 | 0 | 400000.0 |
| 9 | 400 | 0 | 400 | 4000 | 746 | 3254 | 1.00 | 1.00 | 746.0 | 1.00 | 0.0 | 18.65 | 400000.0 | 0 | 400000.0 |
# K-S Statistic
best_gains_o[['percent_cum_actual','percent_cum_non_actual']].plot.line()
plt.show()
This gains chart shows that the model captures 100% of the actual positives (blue line) by the 3rd decile, indicating very strong ranking performance. In contrast, the cumulative percentage of non-actuals (orange line) grows much more gradually, meaning the model effectively prioritizes true positives early while minimizing false positives in the top-ranked deciles.
# cumulative lift
best_gains_o['lift'].plot.line()
plt.show()
This lift chart shows that the model achieves a lift of over 4 in the first decile, meaning it identifies more than four times the positive cases compared to random selection. Lift decreases steadily across subsequent deciles, indicating that the model ranks the most relevant cases early, but predictive power diminishes as it moves toward lower-ranked segments.
2.2 Under-Sampling ¶
from imblearn.under_sampling import NearMiss
NearMiss = NearMiss()
X_train_u, y_train_u = NearMiss.fit_resample(X_train_model, y_train)
print('NearMiss{}'.format(Counter(y_train_u)))
plot_this(X_train_u, y_train_u,'NearMiss')
NearMissCounter({0: 2330, 1: 2330})
2.2.1 H2O Modeling ¶
# H2O initialization
h2o.init()
train_model_u=pd.concat([X_train_u, y_train_u],axis=1)
test_model_u=pd.concat([X_test_model,y_test],axis=1)
# Convert to H2O Frame
train_hex_u = h2o.H2OFrame(train_model_u)
test_hex_u = h2o.H2OFrame(test_model_u)
train_hex_u[target] = train_hex_u[target].asfactor()
test_hex_u[target] = test_hex_u[target].asfactor()
# Run AutoML
aml = H2OAutoML(max_models=20, seed=1)
aml.train(y=target, training_frame=train_hex_u)
# View leaderboard
lb = aml.leaderboard
print(lb)
# Predict
preds = aml.leader.predict(test_hex_u)
Checking whether there is an H2O instance running at http://localhost:54321. connected. Warning: Your H2O cluster version is (4 months and 12 days) old. There may be a newer version available. Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html
| H2O_cluster_uptime: | 5 days 4 hours 28 mins |
| H2O_cluster_timezone: | America/New_York |
| H2O_data_parsing_timezone: | UTC |
| H2O_cluster_version: | 3.46.0.7 |
| H2O_cluster_version_age: | 4 months and 12 days |
| H2O_cluster_name: | H2O_from_python_mxx_h398lp |
| H2O_cluster_total_nodes: | 1 |
| H2O_cluster_free_memory: | 2.615 Gb |
| H2O_cluster_total_cores: | 8 |
| H2O_cluster_allowed_cores: | 8 |
| H2O_cluster_status: | locked, healthy |
| H2O_connection_url: | http://localhost:54321 |
| H2O_connection_proxy: | {"http": null, "https": null} |
| H2O_internal_security: | False |
| Python_version: | 3.11.4 final |
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100% ████████████████████████████████████████████████████████████████| (done) 100% AutoML progress: | 22:38:13.62: AutoML: XGBoost is not available; skipping it. ███████████████████████████████████████████████████████████████| (done) 100% model_id auc logloss aucpr mean_per_class_error rmse mse GBM_3_AutoML_7_20250808_223813 1 2.60402e-17 1 0 8.87076e-17 7.86904e-33 DRF_1_AutoML_7_20250808_223813 1 0.0595403 1 0 0.0777526 0.00604547 GBM_grid_1_AutoML_7_20250808_223813_model_4 1 1.5641e-14 1 0 4.40705e-13 1.94221e-25 GBM_1_AutoML_7_20250808_223813 1 1.97744e-18 1 0 2.23588e-17 4.99915e-34 GBM_grid_1_AutoML_7_20250808_223813_model_5 1 0 1 0 0 0 GLM_1_AutoML_7_20250808_223813 1 0.00099787 1 0 0.00198565 3.94281e-06 GBM_grid_1_AutoML_7_20250808_223813_model_1 1 5.92014e-06 1 0 0.000127262 1.61957e-08 GBM_grid_1_AutoML_7_20250808_223813_model_3 1 2.38148e-09 1 0 8.84254e-08 7.81906e-15 DeepLearning_grid_3_AutoML_7_20250808_223813_model_2 1 0.066989 1 0 0.130773 0.0171016 GBM_5_AutoML_7_20250808_223813 1 7.7668e-18 1 0 4.22854e-17 1.78806e-33 [22 rows x 7 columns] gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
aml.leaderboard.head(20)
| model_id | auc | logloss | aucpr | mean_per_class_error | rmse | mse |
|---|---|---|---|---|---|---|
| GBM_3_AutoML_7_20250808_223813 | 1 | 2.60402e-17 | 1 | 0 | 8.87076e-17 | 7.86904e-33 |
| DRF_1_AutoML_7_20250808_223813 | 1 | 0.0595403 | 1 | 0 | 0.0777526 | 0.00604547 |
| GBM_grid_1_AutoML_7_20250808_223813_model_4 | 1 | 1.5641e-14 | 1 | 0 | 4.40705e-13 | 1.94221e-25 |
| GBM_1_AutoML_7_20250808_223813 | 1 | 1.97744e-18 | 1 | 0 | 2.23588e-17 | 4.99915e-34 |
| GBM_grid_1_AutoML_7_20250808_223813_model_5 | 1 | 0 | 1 | 0 | 0 | 0 |
| GLM_1_AutoML_7_20250808_223813 | 1 | 0.00099787 | 1 | 0 | 0.00198565 | 3.94281e-06 |
| GBM_grid_1_AutoML_7_20250808_223813_model_1 | 1 | 5.92014e-06 | 1 | 0 | 0.000127262 | 1.61957e-08 |
| GBM_grid_1_AutoML_7_20250808_223813_model_3 | 1 | 2.38148e-09 | 1 | 0 | 8.84254e-08 | 7.81906e-15 |
| DeepLearning_grid_3_AutoML_7_20250808_223813_model_2 | 1 | 0.066989 | 1 | 0 | 0.130773 | 0.0171016 |
| GBM_5_AutoML_7_20250808_223813 | 1 | 7.7668e-18 | 1 | 0 | 4.22854e-17 | 1.78806e-33 |
| GBM_4_AutoML_7_20250808_223813 | 1 | 1.19837e-16 | 1 | 0 | 4.31596e-16 | 1.86275e-31 |
| GBM_2_AutoML_7_20250808_223813 | 1 | 1.25079e-17 | 1 | 0 | 5.69227e-17 | 3.24019e-33 |
| GBM_grid_1_AutoML_7_20250808_223813_model_2 | 1 | 4.76491e-20 | 1 | 0 | 3.25272e-18 | 1.05802e-35 |
| StackedEnsemble_AllModels_1_AutoML_7_20250808_223813 | 1 | 6.54544e-05 | 1 | 0 | 6.55141e-05 | 4.29209e-09 |
| StackedEnsemble_BestOfFamily_1_AutoML_7_20250808_223813 | 1 | 8.78055e-05 | 1 | 0 | 0.000101434 | 1.02888e-08 |
| XRT_1_AutoML_7_20250808_223813 | 1 | 0.0574092 | 1 | 0 | 0.0716859 | 0.00513887 |
| DeepLearning_grid_2_AutoML_7_20250808_223813_model_2 | 0.999999 | 0.0469518 | 0.999999 | 0.000214592 | 0.114544 | 0.0131202 |
| DeepLearning_1_AutoML_7_20250808_223813 | 0.999998 | 0.0019878 | 0.999998 | 0.000643777 | 0.0217402 | 0.000472635 |
| DeepLearning_grid_1_AutoML_7_20250808_223813_model_1 | 0.999991 | 0.00683069 | 0.999991 | 0.000643777 | 0.0373196 | 0.00139275 |
| DeepLearning_grid_3_AutoML_7_20250808_223813_model_1 | 0.999987 | 0.0114072 | 0.999987 | 0.00128755 | 0.0559844 | 0.00313425 |
[20 rows x 7 columns]
2.2.2 Top Model Selection ¶
The best choice here is GBM_grid_1_AutoML_7_20250808_223813_model_5, because it achieves perfect scores across key metrics — AUC, AUCPR, and mean per class error — while also having the lowest possible logloss, RMSE, and MSE (all at zero). This indicates that the model fits the training data extremely well, though its perfect performance suggests we should double-check for overfitting before deployment.
model_id = 'GBM_grid_1_AutoML_7_20250808_223813_model_5'
best_model_u = h2o.get_model(model_id)
best_model_u.summary()
| number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
|---|---|---|---|---|---|---|---|---|---|
| 395.0 | 395.0 | 33439.0 | 1.0 | 1.0 | 1.0 | 2.0 | 2.0 | 2.0 |
# Predict on test data
preds = aml.leader.predict(test_hex_u)
# Get performance metrics on test set
perf = aml.leader.model_performance(test_hex_u)
# Print common regression metrics
print("RMSE:", perf.rmse())
#print("MAE :", perf.mae())
print("R2 :", perf.r2())
print("MSE :", perf.mse())
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100% RMSE: 0.03872983346207417 R2 : 0.990113220107733 MSE : 0.0015
# Get predictions (returns an H2OFrame with class and probs)
preds = best_model_u.predict(test_hex_u).as_data_frame()
# View the frame
preds.head()
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
/Users/mxx/miniconda3/lib/python3.11/site-packages/h2o/frame.py:1983: H2ODependencyWarning: Converting H2O frame to pandas dataframe using single-thread. For faster conversion using multi-thread, install polars and pyarrow and use it as pandas_df = h2o_df.as_data_frame(use_multi_thread=True)
| predict | p0 | p1 | |
|---|---|---|---|
| 0 | 1 | 0 | 1.000000e+00 |
| 1 | 0 | 1 | 3.379019e-18 |
| 2 | 0 | 1 | 3.379019e-18 |
| 3 | 0 | 1 | 3.379019e-18 |
| 4 | 0 | 1 | 3.379019e-18 |
best_model_u_probs = preds['p1']
best_model_u_pred = preds['predict']
2.2.3 Model Performance ¶
tblu, cm_tbl_u = make_core_metrics_tables(y_test, best_model_u_pred, best_model_u_probs)
print("Table B — Core sklearn metrics (Under-Sampling)"); display(tblu)
print("\nConfusion Matrix (Under-Sampling)"); display(cm_tbl_u)
Table B — Core sklearn metrics (Under-Sampling)
| metric | value | |
|---|---|---|
| 0 | Accuracy | 0.998500 |
| 1 | ROC AUC | 0.999078 |
| 2 | Average Precision (AUPRC) | 0.992021 |
| 3 | F1 score | 0.995995 |
Confusion Matrix (Under-Sampling)
| Pred 0 | Pred 1 | |
|---|---|---|
| Actual 0 | 3248 | 6 |
| Actual 1 | 0 | 746 |
The under-sampling model delivers exceptional performance, achieving 99.85% accuracy, a ROC AUC of 0.9991, and a high F1 score of 0.9960. The confusion matrix shows it correctly classified all positive cases and misclassified only 6 out of 3,254 negative cases, indicating near-perfect discrimination between classes.
plot_roc(y_test, best_model_o_probs)
ROC_PR(y_test, best_model_u_probs)
* ROC curve: The ROC curve plots the true positive rate vs. the false rositive sate * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)
These curves show that the model delivers almost perfect classification performance. The ROC curve has an AUC of 0.999, indicating near-ideal separation between positive and negative classes, with a true positive rate close to 1 across nearly all false positive rates. The Precision-Recall curve also demonstrates exceptional results, with precision and recall both near 1, confirming that the model maintains high accuracy in identifying positives even in imbalanced data scenarios.
# gains table
best_gains_u = gains_table_profit(y_test,best_model_u_probs)
best_gains_u
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | revenue | cost | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | |||||||||||||||
| 0 | 400 | 316 | 84 | 400 | 316 | 84 | 0.42 | 0.03 | 74.6 | 4.24 | 39.0 | 79.00 | 84000.0 | 3160000 | -3076000.0 |
| 1 | 400 | 304 | 96 | 800 | 620 | 180 | 0.83 | 0.06 | 149.2 | 4.16 | 77.0 | 77.50 | 96000.0 | 3040000 | -2944000.0 |
| 2 | 400 | 126 | 274 | 1200 | 746 | 454 | 1.00 | 0.14 | 223.8 | 3.33 | 86.0 | 62.17 | 274000.0 | 1260000 | -986000.0 |
| 3 | 400 | 0 | 400 | 1600 | 746 | 854 | 1.00 | 0.26 | 298.4 | 2.50 | 74.0 | 46.62 | 400000.0 | 0 | 400000.0 |
| 4 | 400 | 0 | 400 | 2000 | 746 | 1254 | 1.00 | 0.39 | 373.0 | 2.00 | 61.0 | 37.30 | 400000.0 | 0 | 400000.0 |
| 5 | 400 | 0 | 400 | 2400 | 746 | 1654 | 1.00 | 0.51 | 447.6 | 1.67 | 49.0 | 31.08 | 400000.0 | 0 | 400000.0 |
| 6 | 400 | 0 | 400 | 2800 | 746 | 2054 | 1.00 | 0.63 | 522.2 | 1.43 | 37.0 | 26.64 | 400000.0 | 0 | 400000.0 |
| 7 | 400 | 0 | 400 | 3200 | 746 | 2454 | 1.00 | 0.75 | 596.8 | 1.25 | 25.0 | 23.31 | 400000.0 | 0 | 400000.0 |
| 8 | 400 | 0 | 400 | 3600 | 746 | 2854 | 1.00 | 0.88 | 671.4 | 1.11 | 12.0 | 20.72 | 400000.0 | 0 | 400000.0 |
| 9 | 400 | 0 | 400 | 4000 | 746 | 3254 | 1.00 | 1.00 | 746.0 | 1.00 | 0.0 | 18.65 | 400000.0 | 0 | 400000.0 |
# K-S Statistic
best_gains_u[['percent_cum_actual','percent_cum_non_actual']].plot.line()
plt.show()
This cumulative gains chart indicates that the model effectively ranks the actual positives at the top deciles. By the 3rd decile, 100% of actual positives have been captured, while the percentage of non-actuals (false positives) remains relatively low until later deciles. The steep rise of the blue line compared to the orange line demonstrates strong discriminative ability, with the model successfully prioritizing true positives ahead of non-positives.
# cumulative lift
best_gains_u['lift'].plot.line()
plt.show()
This lift chart shows that the model achieves its highest lift of over 4 in the first decile, meaning it identifies more than four times as many positives compared to random targeting. The lift gradually decreases across subsequent deciles, approaching 1 in the last decile, which indicates no improvement over random selection. The steep decline suggests that the model is highly effective at ranking positives at the top but offers diminishing returns as you move to lower-ranked segments.
2.3 Model Comparison ¶
The under-sampling approach clearly outperforms over-sampling in this case. Under-sampling achieved near-perfect accuracy (99.85%) and an F1 score of 0.996, with very few false positives and zero false negatives, indicating excellent balance between precision and recall. In contrast, over-sampling produced a much lower accuracy (81.35%) and an F1 score of 0, failing to correctly classify any positive cases despite a high ROC AUC, suggesting severe class prediction bias.
Section 3 Profit & Loss Calculations (P&L) ¶
Assumptions
- Each approved loan has a principal value of $10,000.
- For non-defaulted loans, the bank earns interest revenue based on a given annual interest rate (e.g., 4%, 8%, 12%).
- If a loan defaults, the bank incurs a total loss of $10,000 (i.e., the full loan amount is unrecovered).
Profit Calculation Framework
- If a loan is approved and does not default (loan_default = 0): The bank earns interest revenue (e.g., $400, $800, or $1,200 depending on the interest rate).
- If a loan is approved and defaults (loan_default = 1): The bank loses the entire $10,000 principal.
- We will compute:
- Total Revenue: Sum of interest earned from loans that did not default.
- Total Loss: Sum of principal lost due to defaults.
- Net Profit: Total Revenue - Total Loss.
What We Can Do
- Compare the profit and loss (P&L) outcomes from both the Decision Tree and Logistic Regression models based on their predictions.
- Perform a sensitivity analysis by calculating the P&L for different interest rates (4%, 8%, 12%).
- Use these results to evaluate which model offers better financial outcomes, and assess how changing interest rates influence model performance and profitability.
3.1 Interest Rate = 4% ¶
3.1.1 Over-Sampling ¶
irr = 0.04 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_o_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 1120000.0
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | revenue | cost | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | |||||||||||||||
| 0 | 400 | 316 | 84 | 400 | 316 | 84 | 0.42 | 0.03 | 74.6 | 4.24 | 39.0 | 79.00 | 33600.0 | 3160000 | -3126400.0 |
| 1 | 400 | 306 | 94 | 800 | 622 | 178 | 0.83 | 0.05 | 149.2 | 4.17 | 78.0 | 77.75 | 37600.0 | 3060000 | -3022400.0 |
| 2 | 400 | 124 | 276 | 1200 | 746 | 454 | 1.00 | 0.14 | 223.8 | 3.33 | 86.0 | 62.17 | 110400.0 | 1240000 | -1129600.0 |
| 3 | 400 | 0 | 400 | 1600 | 746 | 854 | 1.00 | 0.26 | 298.4 | 2.50 | 74.0 | 46.62 | 160000.0 | 0 | 160000.0 |
| 4 | 400 | 0 | 400 | 2000 | 746 | 1254 | 1.00 | 0.39 | 373.0 | 2.00 | 61.0 | 37.30 | 160000.0 | 0 | 160000.0 |
| 5 | 400 | 0 | 400 | 2400 | 746 | 1654 | 1.00 | 0.51 | 447.6 | 1.67 | 49.0 | 31.08 | 160000.0 | 0 | 160000.0 |
| 6 | 400 | 0 | 400 | 2800 | 746 | 2054 | 1.00 | 0.63 | 522.2 | 1.43 | 37.0 | 26.64 | 160000.0 | 0 | 160000.0 |
| 7 | 400 | 0 | 400 | 3200 | 746 | 2454 | 1.00 | 0.75 | 596.8 | 1.25 | 25.0 | 23.31 | 160000.0 | 0 | 160000.0 |
| 8 | 400 | 0 | 400 | 3600 | 746 | 2854 | 1.00 | 0.88 | 671.4 | 1.11 | 12.0 | 20.72 | 160000.0 | 0 | 160000.0 |
| 9 | 400 | 0 | 400 | 4000 | 746 | 3254 | 1.00 | 1.00 | 746.0 | 1.00 | 0.0 | 18.65 | 160000.0 | 0 | 160000.0 |
3.1.2 Under-Sampling ¶
irr = 0.04 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_u_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 1120000.0
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | revenue | cost | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | |||||||||||||||
| 0 | 400 | 316 | 84 | 400 | 316 | 84 | 0.42 | 0.03 | 74.6 | 4.24 | 39.0 | 79.00 | 33600.0 | 3160000 | -3126400.0 |
| 1 | 400 | 304 | 96 | 800 | 620 | 180 | 0.83 | 0.06 | 149.2 | 4.16 | 77.0 | 77.50 | 38400.0 | 3040000 | -3001600.0 |
| 2 | 400 | 126 | 274 | 1200 | 746 | 454 | 1.00 | 0.14 | 223.8 | 3.33 | 86.0 | 62.17 | 109600.0 | 1260000 | -1150400.0 |
| 3 | 400 | 0 | 400 | 1600 | 746 | 854 | 1.00 | 0.26 | 298.4 | 2.50 | 74.0 | 46.62 | 160000.0 | 0 | 160000.0 |
| 4 | 400 | 0 | 400 | 2000 | 746 | 1254 | 1.00 | 0.39 | 373.0 | 2.00 | 61.0 | 37.30 | 160000.0 | 0 | 160000.0 |
| 5 | 400 | 0 | 400 | 2400 | 746 | 1654 | 1.00 | 0.51 | 447.6 | 1.67 | 49.0 | 31.08 | 160000.0 | 0 | 160000.0 |
| 6 | 400 | 0 | 400 | 2800 | 746 | 2054 | 1.00 | 0.63 | 522.2 | 1.43 | 37.0 | 26.64 | 160000.0 | 0 | 160000.0 |
| 7 | 400 | 0 | 400 | 3200 | 746 | 2454 | 1.00 | 0.75 | 596.8 | 1.25 | 25.0 | 23.31 | 160000.0 | 0 | 160000.0 |
| 8 | 400 | 0 | 400 | 3600 | 746 | 2854 | 1.00 | 0.88 | 671.4 | 1.11 | 12.0 | 20.72 | 160000.0 | 0 | 160000.0 |
| 9 | 400 | 0 | 400 | 4000 | 746 | 3254 | 1.00 | 1.00 | 746.0 | 1.00 | 0.0 | 18.65 | 160000.0 | 0 | 160000.0 |
Key observations:
- Both over-sampling and under-sampling approaches achieve the same total positive profit of 1,120,000 at a 4% interest rate, but the distribution across deciles is different.
- In the over-sampling case, positive profits start appearing at decile 3, with earlier deciles (0–2) generating significant losses due to high costs from false positives.
- In the under-sampling case, the positive profit also starts at decile 3, but the loss magnitude in earlier deciles is slightly lower in decile 2 and marginally higher in decile 1 compared to over-sampling.
- The lift and K-S patterns are similar, but small variations could be important for threshold optimization.
3.2 Interest Rate = 8% ¶
3.2.1 Over-Sampling ¶
irr = 0.08 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_o_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 2240000.0
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | revenue | cost | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | |||||||||||||||
| 0 | 400 | 316 | 84 | 400 | 316 | 84 | 0.42 | 0.03 | 74.6 | 4.24 | 39.0 | 79.00 | 67200.0 | 3160000 | -3092800.0 |
| 1 | 400 | 306 | 94 | 800 | 622 | 178 | 0.83 | 0.05 | 149.2 | 4.17 | 78.0 | 77.75 | 75200.0 | 3060000 | -2984800.0 |
| 2 | 400 | 124 | 276 | 1200 | 746 | 454 | 1.00 | 0.14 | 223.8 | 3.33 | 86.0 | 62.17 | 220800.0 | 1240000 | -1019200.0 |
| 3 | 400 | 0 | 400 | 1600 | 746 | 854 | 1.00 | 0.26 | 298.4 | 2.50 | 74.0 | 46.62 | 320000.0 | 0 | 320000.0 |
| 4 | 400 | 0 | 400 | 2000 | 746 | 1254 | 1.00 | 0.39 | 373.0 | 2.00 | 61.0 | 37.30 | 320000.0 | 0 | 320000.0 |
| 5 | 400 | 0 | 400 | 2400 | 746 | 1654 | 1.00 | 0.51 | 447.6 | 1.67 | 49.0 | 31.08 | 320000.0 | 0 | 320000.0 |
| 6 | 400 | 0 | 400 | 2800 | 746 | 2054 | 1.00 | 0.63 | 522.2 | 1.43 | 37.0 | 26.64 | 320000.0 | 0 | 320000.0 |
| 7 | 400 | 0 | 400 | 3200 | 746 | 2454 | 1.00 | 0.75 | 596.8 | 1.25 | 25.0 | 23.31 | 320000.0 | 0 | 320000.0 |
| 8 | 400 | 0 | 400 | 3600 | 746 | 2854 | 1.00 | 0.88 | 671.4 | 1.11 | 12.0 | 20.72 | 320000.0 | 0 | 320000.0 |
| 9 | 400 | 0 | 400 | 4000 | 746 | 3254 | 1.00 | 1.00 | 746.0 | 1.00 | 0.0 | 18.65 | 320000.0 | 0 | 320000.0 |
3.2.2 Under-Sampling ¶
irr = 0.08 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_u_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 2240000.0
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | revenue | cost | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | |||||||||||||||
| 0 | 400 | 316 | 84 | 400 | 316 | 84 | 0.42 | 0.03 | 74.6 | 4.24 | 39.0 | 79.00 | 67200.0 | 3160000 | -3092800.0 |
| 1 | 400 | 304 | 96 | 800 | 620 | 180 | 0.83 | 0.06 | 149.2 | 4.16 | 77.0 | 77.50 | 76800.0 | 3040000 | -2963200.0 |
| 2 | 400 | 126 | 274 | 1200 | 746 | 454 | 1.00 | 0.14 | 223.8 | 3.33 | 86.0 | 62.17 | 219200.0 | 1260000 | -1040800.0 |
| 3 | 400 | 0 | 400 | 1600 | 746 | 854 | 1.00 | 0.26 | 298.4 | 2.50 | 74.0 | 46.62 | 320000.0 | 0 | 320000.0 |
| 4 | 400 | 0 | 400 | 2000 | 746 | 1254 | 1.00 | 0.39 | 373.0 | 2.00 | 61.0 | 37.30 | 320000.0 | 0 | 320000.0 |
| 5 | 400 | 0 | 400 | 2400 | 746 | 1654 | 1.00 | 0.51 | 447.6 | 1.67 | 49.0 | 31.08 | 320000.0 | 0 | 320000.0 |
| 6 | 400 | 0 | 400 | 2800 | 746 | 2054 | 1.00 | 0.63 | 522.2 | 1.43 | 37.0 | 26.64 | 320000.0 | 0 | 320000.0 |
| 7 | 400 | 0 | 400 | 3200 | 746 | 2454 | 1.00 | 0.75 | 596.8 | 1.25 | 25.0 | 23.31 | 320000.0 | 0 | 320000.0 |
| 8 | 400 | 0 | 400 | 3600 | 746 | 2854 | 1.00 | 0.88 | 671.4 | 1.11 | 12.0 | 20.72 | 320000.0 | 0 | 320000.0 |
| 9 | 400 | 0 | 400 | 4000 | 746 | 3254 | 1.00 | 1.00 | 746.0 | 1.00 | 0.0 | 18.65 | 320000.0 | 0 | 320000.0 |
Key observations:
- Both over-sampling and under-sampling approaches achieve the same total positive profit of 2,240,000 at an 8% interest rate, with very similar profit distribution across deciles.
- In both cases, positive profits start appearing at decile 3, with deciles 0–2 producing substantial losses due to high costs from false positives.
- Loss magnitudes in the early deciles are nearly identical between the two methods, showing only minimal differences in revenue and cost patterns.
- Lift and K-S statistics follow almost identical trends for both methods, indicating comparable model ranking performance across deciles.
- Given the similar patterns, the choice between over-sampling and under-sampling at 8% IRR is unlikely to materially impact profit outcomes.
3.3 Interest Rate = 12% ¶
3.3.1 Over-Sampling ¶
irr = 0.12 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_o_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 3360000.0
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | revenue | cost | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | |||||||||||||||
| 0 | 400 | 316 | 84 | 400 | 316 | 84 | 0.42 | 0.03 | 74.6 | 4.24 | 39.0 | 79.00 | 100800.0 | 3160000 | -3059200.0 |
| 1 | 400 | 306 | 94 | 800 | 622 | 178 | 0.83 | 0.05 | 149.2 | 4.17 | 78.0 | 77.75 | 112800.0 | 3060000 | -2947200.0 |
| 2 | 400 | 124 | 276 | 1200 | 746 | 454 | 1.00 | 0.14 | 223.8 | 3.33 | 86.0 | 62.17 | 331200.0 | 1240000 | -908800.0 |
| 3 | 400 | 0 | 400 | 1600 | 746 | 854 | 1.00 | 0.26 | 298.4 | 2.50 | 74.0 | 46.62 | 480000.0 | 0 | 480000.0 |
| 4 | 400 | 0 | 400 | 2000 | 746 | 1254 | 1.00 | 0.39 | 373.0 | 2.00 | 61.0 | 37.30 | 480000.0 | 0 | 480000.0 |
| 5 | 400 | 0 | 400 | 2400 | 746 | 1654 | 1.00 | 0.51 | 447.6 | 1.67 | 49.0 | 31.08 | 480000.0 | 0 | 480000.0 |
| 6 | 400 | 0 | 400 | 2800 | 746 | 2054 | 1.00 | 0.63 | 522.2 | 1.43 | 37.0 | 26.64 | 480000.0 | 0 | 480000.0 |
| 7 | 400 | 0 | 400 | 3200 | 746 | 2454 | 1.00 | 0.75 | 596.8 | 1.25 | 25.0 | 23.31 | 480000.0 | 0 | 480000.0 |
| 8 | 400 | 0 | 400 | 3600 | 746 | 2854 | 1.00 | 0.88 | 671.4 | 1.11 | 12.0 | 20.72 | 480000.0 | 0 | 480000.0 |
| 9 | 400 | 0 | 400 | 4000 | 746 | 3254 | 1.00 | 1.00 | 746.0 | 1.00 | 0.0 | 18.65 | 480000.0 | 0 | 480000.0 |
3.3.2 Under-Sampling ¶
irr = 0.12 # annual interest rate
sensitivity = gains_table_profit(y_test,best_model_u_probs)
selected_profit = sensitivity[sensitivity['profit']>0]['profit'].sum()
print('The profits from the positive profit deciles is:', selected_profit)
sensitivity
The profits from the positive profit deciles is: 3360000.0
| count | actual | non_actual | cum_count | cum_actual | cum_non_actual | percent_cum_actual | percent_cum_non_actual | if_random | lift | K_S | gain | revenue | cost | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| decile | |||||||||||||||
| 0 | 400 | 316 | 84 | 400 | 316 | 84 | 0.42 | 0.03 | 74.6 | 4.24 | 39.0 | 79.00 | 100800.0 | 3160000 | -3059200.0 |
| 1 | 400 | 304 | 96 | 800 | 620 | 180 | 0.83 | 0.06 | 149.2 | 4.16 | 77.0 | 77.50 | 115200.0 | 3040000 | -2924800.0 |
| 2 | 400 | 126 | 274 | 1200 | 746 | 454 | 1.00 | 0.14 | 223.8 | 3.33 | 86.0 | 62.17 | 328800.0 | 1260000 | -931200.0 |
| 3 | 400 | 0 | 400 | 1600 | 746 | 854 | 1.00 | 0.26 | 298.4 | 2.50 | 74.0 | 46.62 | 480000.0 | 0 | 480000.0 |
| 4 | 400 | 0 | 400 | 2000 | 746 | 1254 | 1.00 | 0.39 | 373.0 | 2.00 | 61.0 | 37.30 | 480000.0 | 0 | 480000.0 |
| 5 | 400 | 0 | 400 | 2400 | 746 | 1654 | 1.00 | 0.51 | 447.6 | 1.67 | 49.0 | 31.08 | 480000.0 | 0 | 480000.0 |
| 6 | 400 | 0 | 400 | 2800 | 746 | 2054 | 1.00 | 0.63 | 522.2 | 1.43 | 37.0 | 26.64 | 480000.0 | 0 | 480000.0 |
| 7 | 400 | 0 | 400 | 3200 | 746 | 2454 | 1.00 | 0.75 | 596.8 | 1.25 | 25.0 | 23.31 | 480000.0 | 0 | 480000.0 |
| 8 | 400 | 0 | 400 | 3600 | 746 | 2854 | 1.00 | 0.88 | 671.4 | 1.11 | 12.0 | 20.72 | 480000.0 | 0 | 480000.0 |
| 9 | 400 | 0 | 400 | 4000 | 746 | 3254 | 1.00 | 1.00 | 746.0 | 1.00 | 0.0 | 18.65 | 480000.0 | 0 | 480000.0 |
Key observations:
- Both over-sampling and under-sampling approaches achieve the same total positive profit of 3,360,000 at a 12% interest rate, with identical decile-level profit distribution once profits turn positive.
- In both cases, positive profits begin at decile 3, while deciles 0–2 incur substantial losses caused by high costs from false positives.
- The magnitude of early-decile losses is nearly identical for both methods, with only slight differences in revenue amounts (e.g., decile 2 revenue is marginally higher in under-sampling).
- Lift and K-S statistics follow almost identical trajectories between the two methods, reflecting similar model ranking and discriminatory power.
- Given the nearly identical profit and performance patterns, the choice between over-sampling and under-sampling at 12% IRR would likely have negligible impact on final profit outcomes.
def plot_comparison(irr, over_profit, under_profit, prev_profit):
df_over = pd.DataFrame({'decile': list(range(10)), 'profit': over_profit})
df_under = pd.DataFrame({'decile': list(range(10)), 'profit': under_profit})
df_prev = pd.DataFrame({'decile': list(range(10)), 'profit': prev_profit})
df_over["cum_profit"] = df_over["profit"].cumsum()
df_under["cum_profit"] = df_under["profit"].cumsum()
df_prev["cum_profit"] = df_prev["profit"].cumsum()
fig, axs = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle(f"Profit Comparison by Decile — IRR = {irr*100:.0f}%", fontsize=16, fontweight='bold')
# Profit by Decile
axs[0].plot(df_over["decile"], df_over["profit"], marker='o', label='Over-Sampling', color='blue')
axs[0].plot(df_under["decile"], df_under["profit"], marker='s', label='Under-Sampling', color='orange')
axs[0].plot(df_prev["decile"], df_prev["profit"], marker='^', label='Previous H2O Model', color='green')
axs[0].axhline(0, color='black', linewidth=1, linestyle='--')
axs[0].set_title("Profit by Decile")
axs[0].set_xlabel("Decile")
axs[0].set_ylabel("Profit")
axs[0].legend()
axs[0].grid(True)
# Cumulative Profit
axs[1].plot(df_over["decile"], df_over["cum_profit"], marker='o', label='Over-Sampling', color='blue')
axs[1].plot(df_under["decile"], df_under["cum_profit"], marker='s', label='Under-Sampling', color='orange')
axs[1].plot(df_prev["decile"], df_prev["cum_profit"], marker='^', label='Previous H2O Model', color='green')
axs[1].axhline(0, color='black', linewidth=1, linestyle='--')
axs[1].set_title("Cumulative Profit by Decile")
axs[1].set_xlabel("Decile")
axs[1].set_ylabel("Cumulative Profit")
axs[1].legend()
axs[1].grid(True)
plt.tight_layout()
plt.show()
# ===== Example Call for IRR = 4% =====
plot_comparison(
irr=0.04,
over_profit=[-3136800, -2980800, -1160800, 160000, 160000, 160000, 160000, 160000, 160000, 160000],
under_profit=[-3136800, -2980800, -1160800, 160000, 160000, 160000, 160000, 160000, 160000, 160000],
prev_profit=[-3136800, -2980800, -1160800, 160000, 160000, 160000, 160000, 160000, 160000, 160000]
)
# ===== Example Call for IRR = 8% =====
plot_comparison(
irr=0.08,
over_profit=[-3092800, -3022400, -1129600, 160000, 160000, 160000, 160000, 160000, 160000, 160000],
under_profit=[-3092800, -3022400, -1129600, 160000, 160000, 160000, 160000, 160000, 160000, 160000],
prev_profit=[-3103600, -2941600, -1051600, 320000, 320000, 320000, 320000, 320000, 320000, 320000]
)
# ===== Example Call for IRR = 12% =====
plot_comparison(
irr=0.12,
over_profit=[-3059200, -2947200, -908800, 480000, 480000, 480000, 480000, 480000, 480000, 480000],
under_profit=[-3059200, -2924800, -931200, 480000, 480000, 480000, 480000, 480000, 480000, 480000],
prev_profit=[-3070400, -2902400, -942400, 480000, 480000, 480000, 480000, 480000, 480000, 480000]
)
• Overall trend: Across all three interest rates, the Previous H2O model consistently yields higher cumulative profit starting from decile 3 onward, while Over-Sampling and Under-Sampling produce nearly identical curves.
• IRR = 4%: All three methods incur heavy losses in the first three deciles, but the H2O model reaches break-even sooner and ends with the highest cumulative profit. Over- and Under-Sampling track almost exactly, indicating no meaningful difference between them.
• IRR = 8%: The profit gap between the H2O model and sampling-based methods widens, especially in later deciles, as the H2O model’s cumulative profit grows faster. Over- and Under-Sampling remain very close, with a slight edge for Over-Sampling in early deciles.
• IRR = 12%: The H2O model’s advantage is most pronounced here, achieving the steepest cumulative profit growth after decile 2. Over- and Under-Sampling still overlap heavily, suggesting model choice (H2O vs. sampling) matters more than sampling strategy.
4.2 Conclusions ¶
To evaluate profit performance for predicting mortgage default or fraud, we compared three approaches: Over-Sampling, Under-Sampling, and the previous best-performing H2O AutoML model. Each method was assessed using gains table analysis, lift and K-S statistics, and profit simulations under three interest rate scenarios (4%, 8%, and 12%).
The Previous H2O AutoML model consistently outperformed both Over-Sampling and Under-Sampling across all interest rates. At each IRR, it delivered higher cumulative profits starting from decile 3, maintained strong upward profit trajectories, and avoided the plateau effect seen in the sampling-based methods. Its cumulative profit gains were especially pronounced at higher interest rates, reaching the steepest growth curve at 12% IRR.
Over-Sampling and Under-Sampling produced nearly identical performance across all metrics and interest rates. Both incurred significant early losses in the first three deciles due to high costs from false positives, recovered at decile 3, and then plateaued with stable but lower profits than the H2O model. The differences between the two sampling methods were negligible, with only marginal variations in early-decile profits.
Profit comparison curves showed that while all methods improve as IRR increases, the gap between the H2O model and the sampling-based approaches widens significantly at higher rates, indicating that the model’s superior predictive power translates into greater business value when lending conditions are more favorable.
This analysis demonstrates the importance of evaluating models not only with classification metrics but also through business-driven performance simulations, as technical accuracy does not always guarantee optimal profit distribution across risk segments.
Final Thoughts: Although Over-Sampling and Under-Sampling deliver stable and nearly identical results, the Previous H2O AutoML model is clearly the most profitable and operationally effective choice. Its ability to generate higher profits across all IRRs, coupled with consistent performance gains in higher deciles, makes it the most practical and impactful model for deployment in a mortgage fraud or default detection setting. For future work, combining the H2O model with targeted cost-sensitive adjustments in early deciles could further reduce initial losses and maximize portfolio profitability.